Appendix E — Assignment E

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Do not write your name on the assignment.

  3. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  4. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  5. The assignment is worth 100 points, and is due on Tuesday, 7th March 2023 at 11:59 pm.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (1 pt). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
  • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission (1 pt)
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

Energy model

The datasets ENB2012_Train.csv and ENB2012_Test.csv provide data on energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. Below is the description of the data columns:

  1. X1: Relative Compactness
  2. X2: Surface Area
  3. X3: Wall Area
  4. X4: Roof Area
  5. X5: Overall Height
  6. X6: Orientation
  7. X7: Glazing Area
  8. X8: Glazing Area Distribution
  9. y1: Heating Load

E.1 E.1.1

Suppose that we want to implement the best subset selection algorithm to find the first order predictors (X1-X8) that can predict heating load (y1). How many models for EE(y1) are possible, if the model includes (i) one variable, (ii) three variables, and (iii) eight variables? Show your steps without running any code.

Note: The notation EE(y1) means the expected value of y1 or the mean of y1.

(3 points)

E.2 E.1.2

Implement the best subset selection algorithm to find the best first-order predictors of heating load y1. Print out the model summary.

Note:

  1. Use ENB2012_Train.csv and consider only the first-order terms.

  2. Use the BIC criterion for model selection.

(4 points)

E.3 E.1.3

Should RR-squared be used to select from among a set of models with different numbers of predictors? Justify your answer.

(1 point for answer, 2 points for justification)

E.4 E.1.4

Calculate the RMSE of the model found in E.2. Compare it with the RMSE of the model using all first-order predictors. You will find that the two RMSEs are similar. Seems like the best subset model didn’t help improve prediction.

  1. Why did the best subset model not help improve prediction accuracy as compared to the model with all the predictors?

  2. Does the best subset model provide a more accurate inference as compared to the model with all the predictors? Why or why npt?

Hint: Very Important Fact!

(2 points for computing the RMSEs, 3 + 3 points for justifications)

E.5 E.1.5

Let us consider adding all the 2-factor interactions of the predictors in the model. Answer the following questions without running code.

  1. How many predictors do we have in total?

  2. Assume best subset selection is used. How many models are fitted in total?

  3. Assume forward selection is used. How many models are fitted in total?

  4. Assume backward selection is used. How many models are fitted in total?

  5. How many models will be developed in the iteration that contains exactly 10 predictors in each model – for best subsets, fwd/bwd regression?

  6. What approach would you choose for variable selection (amonst best subset, forward selection, backward selection)?

(62 = 12 points)*

E.6 E.1.6

Use forward selection to find the best first-order predictors and 2-factor interactions of the predictors of y1 (Heating Load). Print out the model summary.

(5 points)

E.7 E.1.7

Calculate the RMSE of the model found in E.1.6. Compare it with:

  1. the RMSE of model you found in E.1.2 and,

  2. the RMSE of the model using all the predictors and all their 2-factor interaction terms.

Among the three models (the model developed in E.1.2, the model developed in E.1.6, the model that has all the predictors and all their 2-factor interactions), discuss which model will you prefer for prediction, and which model will you prefer for inference, and why?

(2 points for computing the RMSEs, 3 + 3 points for discussion)

E.8 E.1.8

Assume that we found another dataset of 32 variabels on the same set of 768 buildings (542 for training) that we would want to add into our model. We want find the “best” model of all 40 predictors and their 2-factor interaction terms. Would you choose forward or backward selection? Justify your answer.

(1 point for answer, 4 points for justification)

Planetary radius model

See https://exoplanetarchive.ipac.caltech.edu (for context/source). We are using the Composite Planetary Systems dataset

E.9 E.2.1

Say we’re interested in modeling the radius of exoplanets in kilometers, which is named as pl_rade in the data. Note that the variable pl_rade captures the radius of each planet as a proportion of Earth’s radius, which is approximately 6,378.1370 km.

Develop a linear regression model to predict pl_rade using all the variables in train_CompositePlanetarySystems.csv except pl_name, disc_facility and disc_locale. Find the RMSE (Root mean squared error) of the model on test1_CompositePlanetarySystems.csv and test2_CompositePlanetarySystems.csv.

(4 points)

E.10 E.2.2

Develop a ridge regression model to predict pl_rade using all the variables in train_CompositePlanetarySystems.csv except pl_name, disc_facility and disc_locale. What is the optimal value of the tuning parameter λ\lambda?

Hint: You may use the following grid of lambda values to find the optimal λ\lambda: alphas = 10**np.linspace(2,0.5,200)*0.5

Remember to standardize data before fitting the ridge regression model

(5 points)

E.11 E.2.3

Use the optimal value of λ\lambda found in the previous question to develop a ridge regression model. What is the RMSE of the model on test1_CompositePlanetarySystems.csv and test2_CompositePlanetarySystems.csv?

(5 points)

E.12 E.2.4

Note that ridge regression has a much lower RMSE on test datasets as compared to Ordinary least squares (OLS) regression. Shrinking the coefficients has reduced the variance of the estimated coefficents with a little increase in bias, thereby improving the model fit. Appreciate it. Which are the top two predictors for which the coefficients have shrunk the most?

To answer this question, find the ridge regression estimates for λ=1010\lambda = 10^{-10} (almost zero regularization). Treat these estimates as OLS estimates and find the predictors for which these estimates have shrunk the most as compared to the model developed in E.2.3.

(4 points for code, 1 point for answer)

E.13 E.2.5

Why do you think the coefficients of the two variables identified in the previous question shrunk the most?

(4 points for justification - including code used)

E.14 E.2.6

Develop a lasso model to predict pl_rade using all the variables in train_CompositePlanetarySystems.csv except pl_name, disc_facility and disc_locale. What is the optimal value of the tuning parameter λ\lambda?

Hint: You may use the following grid of lambda values to find the optimal λ\lambda: alphas = 10**np.linspace(0,-2.5,200)*0.5

(4 points)

E.15 E.2.7

Use the optimal value of λ\lambda found in the previous question to develop a lasso model. What is the RMSE of the model on test1_CompositePlanetarySystems.csv and test2_CompositePlanetarySystems.csv?

(5 points)

E.16 E.2.8

Note that lasso has a much lower RMSE on test datasets as compared to Ordinary least squares (OLS) regression. Shrinking the coefficients has improved the model fit. Appreciate it. Which variables have been eliminated by lasso?

To answer this question, find the predictors whose coefficients are 0 in the lasso model.

(2 points for code, 1 point for answer)

E.17 E.3

We haves used car_features_train.csv and car_prices_train.csv in class notes to predict car price. Based on correlation with price, the four most relevant continuous predictors to predict car price are year, mpg, mileage, and engineSize. In this question, you will use KK-fold cross validation to find out the relevant interactions of these predictors and the relevant interactions of the polynomial transformations of these predictors for predicting car price. We’ll consider quadratic and cubic transfromations of each predictor, and the interactions of these transformed predictors. For example, some of the interaction terms that you will consider are (year2^2), (year)(mpg), (year2^2)(mpg), (year)(mpg)(mileage), (year)(mpg2^2)(mileage), (year)(mpg2^2)(mileage)(engineSize3^3), etc. The highest degree interaction term will be of degree 12 - (year3^3)(mpg3^3)(mileage3^3)(engineSize3^3), and the lowest degree interaction terms will be of degree two, such as (engineSize2^2) or (engineSize)(mpg), etc.

The algorithm to find out the relevant interactions using KK-fold cross validation is as follows. Most of the algorithm is already coded for you. You need to code only part 2 as indicated below.

  1. Start with considering interactions of degree d = 2:

  2. Find out the 5-fold cross validation RMSE if an interaction of degree d is added to the model (You need to code only this part).

  3. Repeat step (2) for all possible interactions of degree d.

  4. Include the interaction of degree d in the model that leads to the highest reduction in the 5-fold cross validation error as compared to the previous model (forward stepwise selection based on K-fold cross validation)

  5. Repeat steps 2-4 until no more reduction is possible in the 5-fold cross validation RMSE.

  6. If d = 12, then stop, otherwise increment d by one, and repeat steps 2-5.

The above algorithm is coded below. The algorithm calls a function KFoldCV to compute the 5-fold cross validation RMSE given the interaction terms already selected in the model as selected_interactions, and the interaction term to be tested as interaction_being_tested. The function must return the 5-fold cross validation RMSE if interaction_being_tested is included to the model consisting of year, mpg, mileage, and engineSize in addition to the already added interactions in selected_interactions. The model equation for which you need to find the 55-fold cross validation RMSE will be'price~year+mpg+mileage+engineSize'+selected_interactions+interaction_being_tested

You need to do the following:

  1. Fill out the function KFoldCV.

  2. Execute the code to obtain the relevant interactions in selected_interactions. Print out the object selected_interactions.

  3. Fit the model with the four predictors, the selected_interactions and compute the RMSE on test data.

Relevance of this question to the linear regression prediction problem: Once you figure out the four most useful predictors to predict money_made_inv, use this algorithm to find out their useful interactions. Combine the model with the EDA-based insight of developing the model only where out_prncp_inv>0, and you should get a RMSE of less than 400 on the public leaderboard! (This algorithm is inspired by Victoria Shi’s solution to the linear regression prediction problem).

Note that this brute-force approach of finding relevant interactions may work sometimes, especially when n>>pn>>p (the number of observations are much higher than the number of predictors). However, in certain problems, a few minutes spent on EDA to figure out transformations, etc. will help develop a better model than this brute force approach. For example, EDA-based transformations helped you get a RMSE of less than $350k on question C.1.6, which is not possible with this approach.

(10 (filling out the function) + 1 + 1 points)

Code
import pandas as pd
import numpy as np
trainf = pd.read_csv('Car_features_train.csv')
trainp = pd.read_csv('Car_prices_train.csv')
testf = pd.read_csv('Car_features_test.csv')
testp = pd.read_csv('Car_prices_test.csv')
test = pd.merge(testf, testp)
train = pd.merge(trainf, trainp)
Code
# Creating a dataframe that will consist of all combinations of polynomial transformations of the 
# predictors to be considered for interactions

predictor_set = ['year','mpg','engineSize','mileage']
from itertools import product
values = np.arange(0,4)
polynomial_transformations = pd.DataFrame(product(values, repeat=4), columns=predictor_set).loc[1:,]
polynomial_transformations.loc[:,'sum_degree'] = (polynomial_transformations).astype(int).sum(axis=1)
polynomial_transformations.loc[:,'count_zeros'] = (polynomial_transformations == 0).astype(int).sum(axis=1)
polynomial_transformations.sort_values(by = ['count_zeros', 'sum_degree'], ascending=[False, True], inplace=True)
polynomial_transformations.drop(columns = ['count_zeros'], inplace=True)
polynomial_transformations.reset_index(inplace = True, drop = True)
Code
#Setting the seed as we are shuffling the data before splitting it into K-folds
np.random.seed(123)
# Shuffling the training set before creating K folds
train = train.sample(frac=1)
k = 5 #5-fold cross validation
fold_size = np.round(train.shape[0]/k)
Code
# Fill out this function - that is all you need to do to make the code work!

# The function must return the mean 5-fold cross validation RMSE for the model
# that has the 4 individual predictors - 'year', 'mpg', 'mileage', and 'engineSize',
# the 'selected_interactions', and the 'interaction_being_tested'

# Uncomment the lines below and fill the function
#def KFoldCV(selected_interactions, interaction_being_tested):
    
        # model = sm.ols('price~year+mpg+mileage+engineSize'+selected_interactions+\
        # interaction_being_tested, data = ...).fit()
        
    #return 
Code
# This code implements the algorithm of systematically considering interactions of degree 2 and going upto 
# the interaction of degree 12. For a given degree 'd' the interactions are selected greedily based on 
# highest reduction in the 5-fold cross validation RMSE. Once no more reduction in the 5-fold cross validation
# RMSE is possible using interactions of degree 'd', interaction terms of the next higher degree 'd+1' are considered.

# 5-fold cross validation RMSE of the initial model with the 4 predictors of degree one
cv_previous_model = KFoldCV(selected_interactions = '', interaction_being_tested = '')
interaction_being_tested = '+'
selected_interactions = ''

# Considering interactions of degree 'd' = 2 to 12
for d in np.arange(2,13):
    
    # Selecting interaction terms of degree = 'd'
    degree_set = polynomial_transformations.loc[polynomial_transformations.sum_degree==d, :]
    
    # Initializing objects to store the interactions of degree 'd' that reduce the
    # 5-fold cross validation RMSEs as compared to the previous model
    interactions_that_reduce_KfoldCV = []; cv_degree = []; 
    
    # Creating another DataFrame that will consist of the updated set of interactions of degree 'd' to be considered
    # as interactions that do not reduce the 5-fold cross validation RMSE will be discarded
    degree_set_updated = pd.DataFrame(columns = degree_set.columns)
    
    # Continue adding interactions of degree 'd' in the model until no interactions reduce 
    # the 5-fold cross-validation RMSE
    while True:
        
        #Iterating over all possible interactions of degree 'd'
        for index, row in degree_set.iterrows():
            
            # Creating the formula expression for the interaction term to be tested
            for predictor in predictor_set:
                interaction_being_tested = interaction_being_tested + ('I('+predictor +'**' +\
                                         str(row[predictor]) + ')*' if row[predictor]>1 else\
                                               predictor + '*' if row[predictor]==1 else '')
            interaction_being_tested = interaction_being_tested[:-1]
            
            # Call the function 'KFoldCV' to find out the 5-fold cross validation error on adding the 
            # interaction term being tested to the model
            cv = KFoldCV(selected_interactions, interaction_being_tested)
            
            # If the interaction term being tested reduces the 5-fold cross validation RMSE as compared to the
            # previous model, then consider adding it to the model
            if cv<cv_previous_model:
                interactions_that_reduce_KfoldCV.append(interaction_being_tested)
                cv_degree.append(cv)
                degree_set_updated = pd.concat([degree_set_updated, row.to_frame().T])
            interaction_being_tested = '+'
        cv_data = pd.DataFrame({'interaction':interactions_that_reduce_KfoldCV, 'cv':cv_degree})
        
        # Sort the interaction terms that reduce the 5-fold cross valdiation RMSE based on their respective
        # 5-fold cross validation RMSE
        cv_data.sort_values(by = 'cv', inplace = True)
        
        # Break the loop if no interaction of degree 'd' reduces the 5-fold cross validation RMSE as
        # compared to the previous model
        if cv_data.shape[0]==0:
            break
            
        # Select the interaction that corresponds to the least 5-fold cross validation RMSE
        selected_interactions = selected_interactions + cv_data.iloc[0,0]
        cv_previous_model = cv_data.iloc[0,1]
        cv_degree = []; interactions_that_reduce_KfoldCV = []
        degree_set = degree_set_updated.copy()
        degree_set_updated = pd.DataFrame(columns = degree_set.columns)
        
        # Print the progress after each model update, i.e., after an interaction term is selected
        print("Degree of interactions being considered:",d, ", 5-fold CV RMSE:", cv_previous_model)